LZW Data Compression
I saw an article in Hugi #15 about LZW compression... it was pretty brief and not for beginners. So I thought I'd write an article about LZW compression. LZW compression is really easy to implement, and you certainly don't need a huge knowledge of data compression to understand and code it.
Disclaimer
If I make a mistake, don't go getting all worked up about it.
If you are going to get all worked up, you shouldn't read this article.
If you read it anyway, I'm not liable.
An outline of LZW compression
In LZW compression we create a table of patterns from the input file, if we find that a pattern in the inputfile is repeated we simply output the index of that pattern in the pattern table. This table is not passed from compressor to decompressor but instead it is generated by both. So, an LZW compressed file consists of indices from a pattern table, nothing else is needed.
LZW Compression
Let me define a few simple terms before we start,
"C": The current character coming from the input file.
"Prefix": Anything from a few to many characters, this preceedes C, Prefix and C together make up a "Pattern".
"Pattern table": A table of patterns we build from the file we are compressing, and decompressing.
They are all the terms you need to know for this description of LZW compression, oh, and for all of you who care, LZW stands for "Lempel-Ziv Welch". Okay, let's start.
What we want to do in LZW compression is, like many other compression routines, try to find patterns that are repeated in the input file, we of course want to find the longest patterns which are repeated. In LZW we start by initialising our pattern table, usually you'd set the first 256 entries to the first 256 characters. However, if you are sure the file you are compressing will not contain any more than (any) N characters you would initialise only these N characters. Here are the steps you follow to preform LZW compression, I'll explain them afterward. (Remember: C stuck on the end of Prefix makes up a Pattern.)
<1> Initialisation
(i) Fill the pattern table with the first 256 characters.
(ii) Clear the prefix.
Loop until EOF.
<2> Get the next character from the input file.
<3> Is the current pattern in the pattern table?
- Yes:
(i) Prefix = Pattern.
- No : (i) Add the current pattern to the pattern table
(ii) Output prefix's index in the pattern table.
to the output file.
(iii) Prefix = C
<4> Go to step 2
That's all there is to it! Once you reach the EOF at step 3 just output prefix's index to the output file and you're finished.
Now, you might wonder about step 3, and why exactly we do everything we do in it, it's all very simple. Since we want to find the longest match, we must keep collecting information, each time we go round the loop we say, I know prefix is in the pattern table, but is prefix and the next character together? If so I'll have a longer match. If it is in the pattern table, we say, well what about the next character along with that (getting reeeal excited)! So we set prefix to itself and the next character, if it is not in the pattern table, we say, ok, so prefix without this character (the character we just read) is the longest match, I'll send its index in the pattern table, and I'll add it to the pattern table too, for future reference.
It's obvious we would be able to reconstruct this pattern table at decompression, you might even see how to do it already, but bare with me for a few implementation details.
In LZW, it is usual to start with a 9 bit pattern table, meaning that it can hold 512 strings, we'd only start with a 9 bit pattern table if we were using 256 characters to start with though. As we are constructing the pattern table, if we cannot represent a certain pattern's index in the current number of bits, we increase the number of bits (by one). It is usual to let it increase to 12 bits, if it needs to exceed this size we empty the pattern table, and start again. There is no reason why we shouldn't use more bits than 12 however (to the best of my knowledge).
Another important implementation detail to note is that we don't actually have to fill the pattern table with all the first 256 characters if we know we will be using all 256 characters, when we come across a pattern of length 1 we know it must be in the pattern table, so there's no need to search for it.
The decompressor contructs the table in an identical manner so it knows when to empty the table. That's all to say LZW Compression, how about Decompression though? Ok.
LZW Decompression
I don't think I need to say all that much about this, let's see... Remember we are trying to be exactly the same as the compressor here, except we are working backwards, look at this uncompressed file : and imagine our character set is 1 = A, 2 = B, 3 = C, 4 = D.
ABCABADABA
The compressor would compress that into (I seperate different indices with a '-')
1-2-3-5-1-4-6
Our compressor comes along and starts: First we get the first index (1), we notice it maps to 'A' so we output 'A' to the output file, we know at this point, when our compressor found a pattern already in the pattern table it set the prefix to it, so we have to do that too. Next we the index 2, and we find that it maps to 'B' so we output 'B' to the output file. Our compressor would have had prefix: 'A' at this point and C: 'B' at this point and checked for them in the pattern table, if it found them it would output "AB" 's index, if not it would have added them. We must add this pattern "AB", it would now have outputted 'A', and set prefix to 'B', because that was the newest character to form a pattern not in the pattern table. Now we get index 3, and see that it is 'C', so we output it. We add "BC" to the pattern table because the compressor would have done this and outputted the index for 'B'.
Next we see an index that is not in our original character set, prefix becomes 'C' at this point. We output the pattern at this index (it should be in the pattern table unless our decompressor doesn't work correctly) the pattern is "AB". Now we add "CA" to the pattern table, because when forming this pattern "AB" it would have added first "CA", outputted 'C', then prefix would become 'A', it would get 'B', find "AB" in the pattern table and output its index. Now prefix becomes "AB" because it the compressor would have done this, because when it finds a pattern already in the pattern table it sets prefix to it in order to find a longer match. Next we get an index corresponding to 'A', we output 'A' to the output file and add "ABA" to the pattern table. Now prefix becomes 'A', because the compressor would have done this (I've already explained why!). Now we get an index corresponding to 'D' and add "AD" to the pattern table. Then we come to a pattern that has an index outside our original character set, so we get the pattern corresponding to it in the pattern table and output it. Note at this we have to add the patterns to the table like we did when we found the index corresponding to "AB"...
Now we have, as our uncompressed file:
ABCABADABA it's right!
That's ALL there is to it.
Of course, we have to deal with things like, increasing the number of bits if our pattern table is full with a certain number, and hence reading more bits for each pattern. Also, we have to empty the table if it gets more than our max number of bits, all that is easy, and do it the same way you did in compression.
No, I'm not going to give pseudo-code, that (long) example explains everything already and I'm tired of typing.
Any Questions/Comments/Flames: